-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes towards an engine that could help in Neurolang #1
base: main
Are you sure you want to change the base?
Conversation
commit e5fac1a Author: Nils Braun <[email protected]> Date: Sun Feb 7 16:20:55 2021 +0100 Aggregate improvements and SQL compatibility (dask-contrib#134) * A lot of refactoring the the groupby. Mainly to include both distinct and null-grouping * Test for non-dask aggregations * All NaN data needs to go into the same partition (otherwise we can not sort) * Fix compatibility with SQL on null-joins * Distinct is not needed, as it is optimized away from Calcite * Implement is not distinct * Describe new limitations and remove old ones * Added compatibility test from fugue * Added a test for sorting with multiple partitions and NaNs * Stylefix commit 7273c2d Author: Nils Braun <[email protected]> Date: Sun Feb 7 15:34:55 2021 +0100 Docs improvements (dask-contrib#132) * Fixed a bug in function references in docs * More details on the dask-sql internals commit bdc518e Author: Nils Braun <[email protected]> Date: Sun Feb 7 14:19:50 2021 +0100 Fix the fugue dependency (dask-contrib#133)
… into various_changes
… into various_changes
…ding temporary columns to merge on
… into various_changes
…work in all cases. Needs to be reworked.
… into various_changes
… into various_changes
… column aggregations
…y-apply functions
… as a filter expression
…on all the columns without an aggregation function (i.e calling distinct())
…of the tests in neurolang
Hi @demianw! Sorry for spying around on your fork :-) I have seen you have done a lot of changes and added in total two new LogicalPlan implementations and the additional materialized table implementation, which I think is quite interesting! I am happy to discuss with you any further collaboration :-) |
Hi @nils-braun in fact @jonasrenault took most of the work. We have extended and adapted the functionality but very aimed at our use-case which is a Datalog implementation. Hence the planners might use a different set of rules than those more adapted to SQL semantics. We would be happy to contribute better to Dask-SQL but we should define a work-plan as our time availability for the desk-sql part is limited. We might discuss with @jonasrenault if you are up for it and see what makes sense to include in the main project and how.
|
Hi @nils-braun , Thanks for the great library :) We have indeed adapted it recently to try to fit it into our specific use case, which is trying to benefit from Calcite's query optimizer when solving datalog queries. As @demianw mentioned, we haven't tried to maintain full compatibility between dask-sql and SQL since this isn't our priority, so some of the changes we've made probably shouldn't be integrated into dask-sql. Here are the main points we worked on :
|
Great answers and thoughts, both @demianw and @jonasrenault.
That makes a lot of sense and I think we can start here. There is already one PR open (dask-contrib#135) which just needs a bit of cleanup. I am very happy to help here, I just don't want to interfere with you. I would be happy to support here (e.g. if you do not want to implement tests for all edge cases etc.).
I think we have diverged a bit, because this is now also implemented in dask-sql. For "totally arbitrary" join conditions it is unfortunately needed to do a full cross join and then filter afterwards, but after dask-contrib#148 was fixed, this should at least be a bit more optimized... I however like your logic to also parse the SqlCast operation.
Ok, I see. Here I think it makes sense that you fork, because
That is really cool, because I have also seen that this is a major performance bottleneck. I already tried to reduce the number of usages of the assign column, but I would be very happy to also get your optimizations in. I haven't checked how "smart" the renaming function is on cycling renames, but apart from that I see no problem.
Unfortunately, I can definitely agree. Calcite is very very good in doing what it does, it just needs a bit more documentation for library users.
I do not think I can help you much here (also just a "user"). The way dask-sql uses materialized queries, is by storing the Dask graph as a new table. By this, the graph is re-evaluated on every calculation, which means any access to external systems (e.g. files on disk) is done again. However, this does not imply access to other dask-sql tables (they are "hardcoded" into the graph). so that might or might not what you need. So as a summary, I am really impressed with the amount of work you have put into this and the results. I would be very happy to have the two plugins you implemented in dask-sql and maybe also the optimization for the |
… evaluated in the for loop, and not at the end with only the last function
Added